Selwyn-Lloyd McPherson (selwyn.mcpherson@gmail.com)
Here, we are going to address a minuature ddPCR Classification Problem. Droplet Digital polymerase chain reaction provides high-precision quantification of nucleic acid target sequences and has wide-ranging applications for both research and clinical diagnostics. ddPCR measures absolute quantities by counting nucleic acid molecules encapsulated in discrete, volumetrically defined water-in-oil droplet partitions.
Following PCR amplification of the nucleic acid target in the droplets, the plate containing the droplets is placed in a reader, which analyzes each droplet individually using a two-color detection system, picking up the droplets from each well of the PCR plate.
Droplets are spaced out individually for fluorescence reading by the droplet reader. Fluorescence in two channels is then measured for individual droplets. Positive droplets, which contain at least one copy of the target DNA molecule, exhibit increased fluorescence compared to negative droplets.
Data can be viewed as a 1-D plot with each droplet from a sample plotted on the graph of fluorescence intensity vs. droplet number. All positive droplets, those above the red threshold line, are scored as positive and each is assigned a value of 1. All negative droplets, those below the red threshold line, are scored as negative and each is assigned a value of 0.
Droplet Digital PCR data from a duplex experiment in which two targets are PCR amplified can also be viewed in a 2-D plot in which channel 1 fluorescence is plotted against channel 2 fluorescence (HEX or VIC) for each droplet.
First, we perform some initial analyses almost always done with any kind of data in preparataion for a more rigorous stastical or machine learning pipeline. We will gather the data, spot-check some basic statistics, and begin to evaluate the data based on looking at each channel individually and then the two channels combined.
Basic ingegestion of simple CSV data.
# The usual suspects
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# For some demonstration down below
from scipy.stats import probplot
# Settings
fn = 'assay_results.csv'
# Read
raw_data = pd.read_csv(fn)
Sometimes it's just helpful to separate variables into more descriptive terms.
# Separate ertain columns for convenience
channels = raw_data[['Ch1', 'Ch2']]
labels = raw_data[['Cluster_labels']]
# Simplify names
labels = labels.rename({'Cluster_labels': 'labels'}, axis=1)
Now, always take the opportunity to glance at our pretty new data!
raw_data.head()
| Ch1 | Ch2 | Cluster_labels | Sample | |
|---|---|---|---|---|
| 0 | 1819.749 | 3058.094 | 0 | B03 |
| 1 | 1785.329 | 2871.747 | 0 | C01 |
| 2 | 1686.828 | 2721.947 | 0 | B01 |
| 3 | 1768.297 | 3080.490 | 0 | B03 |
| 4 | 1769.638 | 3064.187 | 0 | B03 |
And of course info() to understand the data types and counts
raw_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 93524 entries, 0 to 93523 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ch1 93524 non-null float64 1 Ch2 93524 non-null float64 2 Cluster_labels 93524 non-null int64 3 Sample 93524 non-null object dtypes: float64(2), int64(1), object(1) memory usage: 2.9+ MB
Panda's describe() is a very basic function for grabbing basic statistics. The quantiles, mean, and std, for example, may indicate that Channel 1 and Channel 2 may differ in shape. Here is a simple report of Channel 2.
# describe() with extra percentiles
channels['Ch2'].describe(percentiles = [0.2, 0.4, 0.6, 0.8])
count 93524.000000 mean 4649.434191 std 3202.747541 min 502.650300 20% 2802.020000 40% 2891.678400 50% 2934.262000 60% 2985.589000 80% 8403.844200 max 18668.020000 Name: Ch2, dtype: float64
This is a third party, pretty extensive (and also expensive) overview of the data, and we will manually recreate most of the analyses included. As such, I almost regret including it here because it is, essentially, very lazy, and although I trust the veracity of what is presented, it is more suited to a 10 second santiy check to verify the integrity of the data than anything else. This is the ultimate meta-package, which personally makes me feel uncomfortable, but there are some interesting bits of insight, like the size in memory or the identification of infinite values, both of which can obviously be determined very quickly anyway. And some of the 'alerts' are pretty misleading.
On second thought, I'm going to leave it here but comment it out because I think it's more distracting than useful. If you're curious, pip install pandas_profiling and check out your favorite dataset.
import pandas_profiling as pp
# Super profile
# pp.ProfileReport(raw_data)
This is actually quite important as we go further in our analysis. We'll find that the distribution of the labels is certainly not uniform, and likely not normally distributed.
import plotly.graph_objs as go
# Count the values
dist = raw_data['Cluster_labels'].value_counts()
# The trace
trace = go.Pie(values=(np.array(dist)), labels=dist.index)
layout = go.Layout(title='Channel Labels')
# Arrize data
data = [trace]
# Create figure
fig = go.Figure(trace, layout)
fig.update_traces()
fig.show()
As we interrogate each channel separately, we may also gain a lot by some simple anlyses to understand their differences.
I'll first explore visual approaches and then statistical approaches. The first appeals to intuition while the advantage of the second is rigor.
The common boxplot is well known and provides both summary statistics and data visualization. The median, first, and third quartiles are shown, along with boundaries at 1.5 times the interquartile range, as well as outliers, plotted individually. For this case, we will mean-center the data to provide a more compelling visual analysis.
plt.figure(figsize=(8, 6))
# Box plot
sns.boxplot(data=channels-channels.mean())
plt.title('Boxplot');
It looks as though Ch1 has a bit of a tail; let's followup with our histograms.
When processing a large number of datasets which can potentially have different data distributions, we are confronted with the following considerations:
How we can we get to know our data by understanding the distribution (uniform distribution, T-distribution, chi-square distribution, cauchy distribution, etc)?
If the distribution is multimodal, can we identify the number and nature of modes and provide more granular descriptive statistics? How can we estimate the probability density function of a new dataset?
Histograms are estimators and approximate density based on some data. Whereas they involve discrete data (individual bins or classes), a probability density function (PDF/KDE) involves continuous data (a smooth curve). We use the empirical distribution (i.e. the histogram) to describe our sample, and the PDF to describe the hypothesized underlying distribution.
The simplest is the familiar histogram:
plt.figure(figsize=(8, 6))
# Simple histogram
sns.histplot(data=channels, bins=50);
plt.title('Histogram');
Hmm, this looks weakly bimodal, at least. That might be interesting. How can we evaluate the tails? Let's try looking at the kernal density function. This will give us a density estimation using kernels dependent on an arbitrary bandwidth; which means the smoothness can be adjusted by varying the window size, much like a wavelet analysis.
Here, we plot the normalized, standard KDE (as lines) and the density without normalizing each density to 1 (as filled alpha)
# Normalized
plt.figure(figsize=(8, 6))
sns.kdeplot(data=channels);
# Without common_norm as filled
sns.kdeplot(
data=channels,
fill=True, common_norm=False, palette='crest',
alpha=.3, linewidth=0,
)
plt.title('Kernel Density Function')
Text(0.5, 1.0, 'Kernel Density Function')
The mean peaks appear comparable and the minor modes are just slightly offset. Channel 2 is even looking slightly trimodal. This is important for clustering because it suggests distinct areas of data that have local coherence.
Depending on the nature of the underlying distribution, certain bandwidths (kernals) may be more or less appropriate for understanding the density. Here, we try various bandwidth values. We see that with a narrow bandwidth, 0.05, we lose the rightmost mode. With a smoother bandwidth, 0.3, we start to lose serious features entirely. The happy medium may be around 0.1 (there are computational methods to quantify this, of course). Also note, behind the scenes we are using the standard Gaussian kernel, although several others are available.
fig, ax = plt.subplots()
# Downsample the data for computational purposes
data = channels['Ch2'].sample(1000)
# Histogram
ax.hist(
data,
bins=100,
label='Histogram from samples',
zorder=5,
edgecolor='k',
density=True,
alpha=0.2,
)
# Plot bandwidths
for bandwidth in [0.05, 0.1, 0.3]:
sns.kdeplot(data, shade=True, bw_method=bandwidth, ax=ax, label=f'Bandwidth = {bandwidth}')
# Plot
plt.title('Variation in density based on bandwidth')
ax.legend(loc='upper left')
plt.show()
A Q-Q plot is a scatterplot created by plotting two sets of quantiles (or percentiles) against one another. If both sets of quantiles come from the same type of distribution, we should see the points forming a line that is roughly straight, and at a 45 degree angle. Say, for example, we would like verify that a concocted distribution is in fact normal. Let's generate some data we think should follow a normal distribution generate a Q-Q plot.
statsmodels's default comparison distribution is scipy.stats.distributions.norm (a standard normal). Here is data as an example of a near normal distribution:
import statsmodels.api as sm
# Normal distribution
a = np.random.normal(5, 5, 250)
# Analysis
# (statsmodels' default comparison distribution is scipy.stats.distributions.norm)
sm.qqplot(a,fit=True,line='45')
plt.show()
Looks like our toy data is normalish (only the extreme tails show a very small deviation). On the other hand, here is the normality deviance plot for Channel 1 (Channel 2 is similar). It looks far from normal. We see strong deviation from the expected qualtiles for a normal distrubution:
sm.qqplot(channels['Ch1'], fit=True, line='45')
plt.title('Channel 1 Normality Deviance')
plt.show()
Variables within a dataset can be correlated for lots of reasons and under many circumstances. Specifically, the degree of linear covariance is understood to be the average of the product between the values from each sample, where the values haven been centered (had their mean subtracted):
cov(X, Y) = (sum(x - mean(X)) * (y - mean(Y))) * 1/(n-1)
In the case of covariance, it is required for the two samples to have Gaussian or Gaussian-like distribution. Now, this isn't necessarily the case for our dataset as a whole, as we have right-sided tails. We can certainly clip these off but, in absense of a deep dive, to display the mechanics, the covariance matrix can be found easily using both NumPy and our understanding of the meaning of covariance. Let's do both! They'd better be the same!
data = np.array(channels)
# Grab shape
N = data.shape[0]
n_dim = data.shape[1]
# Initialize covariance matrix
cov = np.zeros((n_dim, n_dim))
# Collect the means for each channel
avg = []
for i in range(n_dim):
avg.append(sum([d[i] for d in data]) / N)
# Generate the covariance matrix
for i in range(n_dim):
for j in range(n_dim):
var = 0 # Accumulator
for d in data:
var += (d[i] - avg[i]) * (d[j] - avg[j])
var /= (N - 1)
cov[i, j] = var
# Our version
print('Manually Calculated Covariance:\n\n{}'.format(cov))
# Numpy Version
print('\nNumPy Covariance:\n\n{}'.format(np.cov(channels, rowvar=False, bias=False)))
Manually Calculated Covariance: [[17795207.27483362 11681467.11044675] [11681467.11044675 10257591.80899627]] NumPy Covariance: [[17795207.27483365 11681467.11044679] [11681467.11044679 10257591.80899617]]
Success!. The diagonal elements give the variance along each of the dimensions while the non-diagonal elements reveal the covariance between the x and y dimenions (for 2D analysis). Since the non-diagonal elements of the matrix are positive, we see that Channel 1 and Channel 2 are positively correlated.
The size, or amplitude, of the elements of the covariance matrix do not necessarily fall between -1 and 1, as is found in other methods. The size of the values actually depends on the actual values of the variables, so we see high values here since the values of our data are can be quite large.
The covariance matrix, as well as its transforms, eigenvalues, and eigenvectors, can be used for a number of concepts in optimization (including PCA, which I discuss briefly below) as well as other fields.
After calculating the covariance matrix, we can quickly ask, what other kinds of correlation can we think about, and why are they important?
Correlation differs from covariance in that covariance illustrates how two variables differ, whereas correlation described how the two variables are related. We use both to understand the relationships between variables and values but the two do differ slightly. Covariance indicates the direction of the linear relationship between variables whereas correlation measures buth the strength and direction of that linear relationship. Additionally, correlation values are standardized (-1, 1) where covariance values are not.
The Pearson correlation coefficient is also used to summarize the strength of the linear relationship between two data samples.
The Pearson’s correlation coefficient is calculated as the covariance of the two variables divided by the product of the standard deviation of each data sample. It is the normalization of the covariance between the two variables to give an interpretable score.
Pearson's correlation coefficient is defined as:
covariance(X, Y) / (stdv(X) * stdv(Y))
The use of mean and standard deviation in the calculation suggests the need for the two data samples to have a Gaussian or Gaussian-like distribution.
The coefficient returns a value between -1 and 1 that represents the limits of correlation from a full negative correlation to a full positive correlation. A value of 0 means no correlation. Often a value below -0.5 or above 0.5 indicates a notable correlation, and values between those values suggests a less notable correlation.
The pearsonr() SciPy function can be used to calculate the Pearson’s correlation coefficient between two data samples (as long as they have the same length).
We can calculate the correlation between the two variables in our assay data:
from scipy.stats import pearsonr
# Calculate correlation
corr, _ = pearsonr(channels['Ch1'], channels['Ch2'])
print('Pearsons correlation: {:.3f}'.format(corr))
Pearsons correlation: 0.865
If one of the two variables does has a non-Gaussian distribution, or one variable has less power than its partner, we find Spearman's Correlation to be of use.
As with the Pearson correlation coefficient, the scores are between -1 and 1 for perfectly negatively correlated variables and perfectly positively correlated, respectively.
Instead of calculating the coefficient using covariance and standard deviations on the samples themselves, these statistics are calculated from the relative rank of values on each sample. This is a common approach used in non-parametric statistics, e.g. statistical methods where we do not assume a distribution of the data such as Gaussian.
Spearman's correlation coefficient is defined as:
covariance(rank(X), rank(Y)) / (stdv(rank(X)) * stdv(rank(Y)))
A linear relationship between the variables is not assumed, although a monotonic relationship is assumed.
Uncertainty about the distribution and possible relationships between two variables suggests a good opportunity to calculate Spearman correlation coefficient.
The spearmanr() SciPy function can be used with ease:
from scipy.stats import spearmanr
# Calculate correlation
corr, _ = spearmanr(channels['Ch1'], channels['Ch2'])
print('Spearmans correlation: {:.3f}'.format(corr))
Spearmans correlation: 0.747
We can also visualize data relationships using a pair plot which, again, is more interesting when trying to understand the correlations between two of a large set of features. But this is still insightful as we have added our ground-truth cluster labels. As we've discussed before, the clusters are clearly coherent. The NW-SE axis simply shows the individual histogram's channel while the SW-NE shows the scatter plot which allows for visual inspection.
# Pairplot
sns.pairplot(raw_data, hue='Cluster_labels', palette='dark', height=6, aspect=2)
plt.title('Channel1/Channel2 Pair Plot')
Text(0.5, 1.0, 'Channel1/Channel2 Pair Plot')
To delver further into analyses in which both channels are considered, we return to the kdeplot and recognize that the bivariate kernel density estimation can tell us a lot about the distributions of data. Here, we actually see more correlations in density than the ground truth labels. If we visually inspect the pairplot from above, we certainly see more than four distinct areas; the 2D kdeplot in some ways tells the same story, though it will take a deeper dive to isolate these kde clusters and plot them on the scatterplot.
# Simple KDE Plot (via sns)
sns.kdeplot(
data=channels,
x='Ch1',
y='Ch2',
fill=True,
)
plt.title('Channel 1 / Channel 2 Bivariate KDE')
Text(0.5, 1.0, 'Channel 1 / Channel 2 Bivariate KDE')
Notice that these blobs seem like they may be good candidates for clustering (if we didn't have the ground truth already)! That's exciting!
This also illustrates the need not just to understand how use plotting functions to visualize data but also to grab numerical data from actual the kernel density calculations (from scipy.stats. gaussian_kde), since we want to be able to manuplate these numbers ourselves.
Another way to use Q-Q plots is to interrogate two populations of interest. We previously used them to generate probability plots characterizing one distribution against a 'standard'. Now, with qqplot_2samples, can use the same logic to compare two arbitrary distributions of our own chooosing. If we look at Channel 1 and Channel 2:
from statsmodels.graphics.gofplots import qqplot_2samples
# Figure
fig, ax = plt.subplots(figsize=(8, 6))
# 2-Sample Q-Q Plot
qqplot_2samples(channels['Ch1'],
channels['Ch2'],
xlabel='Quantiles of Channel 1',
ylabel='QUantiles of channel 2',
line='45',
ax=ax)
# Formatting
ax.axes.xaxis.set_ticklabels([])
ax.axes.yaxis.set_ticklabels([])
plt.title('Channel 1 / Channel 2 Distribution Coherence')
Text(0.5, 1.0, 'Channel 1 / Channel 2 Distribution Coherence')
We see approximate coherence but with increasing deviance as we move towards the right. This is likely caused by the long tail, to varying degrees, of the two channels as the values increase.
Because qqplot takes into account, at its core, the probability distributions of the variables, certain differences will be accentuated. These are insights that are subtly more revealing and that we do not see when we just look at plt.scatter, for example:
import matplotlib.lines as mlines
# Quantile-quantile plot
fig, ax = plt.subplots(figsize=(8, 6))
plt.scatter(np.sort(channels['Ch1']), np.sort(channels['Ch2']))
plt.xlabel('Channel 1')
plt.ylabel('Channel 2')
# All this just for the y=x red line
line = mlines.Line2D([0, 1], [0, 1], color='red')
transform = ax.transAxes
line.set_transform(transform)
ax.add_line(line)
plt.title('Simple Scatter plot against a linear coherence')
plt.show()
plt.close()
Given that we know the ground truth cluster labels, we can check out how they are distributed against the two channels. Again, we won't be calculating anything yet (we will later); this is purely descriptive.
A very basic jointplot is made up, essentially, of three graphs: the main graph depicts the multivariate statistical graph (scatter plot) that demonstrates how each variable fluctuates with the other. The second graph, located at the upper edge, illustrates the unbiased variable’s dispersion. The right edge is the same and represents that variable’s dispersion.
In addition, we color the points to illustrate their known cluster identification. What we hope to investigate, via this simple visualization, is how well the clusteres are 'properly' centered around 'similar' neighborhoods. From the quick look below, we see a lot of data extending towards high Ch1/Ch2 values, that look awfully distinct, despite being labeled the same. We may hypothesize that the clusters as they have been currently identified have been selected because the majority of the mass of points is centered around the main central peak on the left hand side, and that the variation of the data has not fully been leveraged.
# Jointplot with cluster labels
sns.jointplot(data=raw_data, x='Ch1', y='Ch2', hue='Cluster_labels')
<seaborn.axisgrid.JointGrid at 0x11d90d3d0>
By looking at the histograms, for example for Channel 1, it is obvious that we have one cluster towards the lower values. Then, having separated out the labels, we see a much more complex structure for cluster 3. A wavelet analysis (like the KDE, but more sensitive) might be able to pick out these almost six peaks. This reminds us of the KDE plots we saw above.
# sns settings are weird
sns.set(rc={'figure.figsize': (10, 6)})
# Plot
for cluster, color in zip(range(3), ['r', 'g', 'b', 'k']):
sns.histplot( raw_data.loc[raw_data['Cluster_labels'] == cluster]['Ch1'] , color=color)
There are, of course, numerous methods, both supervised and unsupervised, for classifying data. We'll discuss and implement a few here.
For purposes of validation, we separate training data from testing data. The model well be trained using the former and assessed using the latter. Here, I use 75%/25% training/test.
from sklearn import model_selection
# Just to standardize
X = channels
y = labels
# Create the train/test split
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y,
test_size=0.25, random_state=0)
# Sanity check the number of training and testing points
print('Training set shape: {}'.format(X_train.shape))
print('Testing set shape: {}'.format(X_test.shape))
Training set shape: (70143, 2) Testing set shape: (23381, 2)
Generally, with more complicated datasets, we try to do some dimensionality reduction in order to reduce the space of our data. If we are looking at data with many variables, we recognize that some of them may not contribute significantly to the variance and as such are not worth including in our model. Thus, PCA technique is particularly useful in processing data where multi-colinearity exists between the features/variables; we essentially want to eliminate those that are covariant with others, and in general find a small set of features that well describe the data at large. The principle components are, then, certain linear combinations of the full feature set that enrich the training vectors while at the same time reducing complexity.
Since there are only two features (channels) here, I won't perform a PCA analysis, but as it's often done, it worth a brief mention.
The classifier I'll implement here is one of the classics: The KNeighborsClassifier. I will go much further into the implementation in Section C but very briefly, the algorithm hopes to find a certain number of points (defined by k), that are 'representative' of their class, along with identifying and ensuring that each each of the points in the dataset are associated with these representative points by virtue of distance, thus creating classes/groups of nearby points, identifying each point as a member of one class. It iteratively determines whether a point is properly classified and progressively alters both the representatives (centroids) and the constituent points towards optimality. It is supervised, well-studied, and more than 70 years old.
Here's the absolute basic implementation:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
# Build classifier with n_neighbors=1
classifier = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train.values.ravel())
y_pred = classifier.predict(X_test)
print('Confusion Matrix\n----------------')
print(metrics.confusion_matrix(y_test, y_pred))
print('\n\nClassification Report\n---------------------')
print(metrics.classification_report(y_test, y_pred))
Confusion Matrix
----------------
[[17012 0 0 0]
[ 0 383 2 0]
[ 3 1 5577 2]
[ 0 0 1 400]]
Classification Report
---------------------
precision recall f1-score support
0 1.00 1.00 1.00 17012
1 1.00 0.99 1.00 385
2 1.00 1.00 1.00 5583
3 1.00 1.00 1.00 401
accuracy 1.00 23381
macro avg 1.00 1.00 1.00 23381
weighted avg 1.00 1.00 1.00 23381
The results seem quite good. The confusion matrix for this multiclass model is a larger than the a 2x2 matrix that would traditionally indicate the four true and false positives and negatives. Unlike binary classification, in our situation, there are no positive or negative classes. To understand a 4x4 confusion matrix, we find TP, TN, FP and FN for each individual class. For example, if we take class 1 (the first column and row), we can find our statistics in a more general way:
(values from our generated confusion matrix):
TP (is class 1, correctly estimated as class 1) = 383
TN (is not class 1, correctly estimated as not class 1) = (17012 + 0 + 0 + 3 + 5577 + 2 + 0 + 1 + 400)
FP (is not class 1, incorrectly classified as class 1) = (0 + 1 + 0)
FN (is class 1, not estimated as class 1) = (0 + 2 + 0)
Since we have all the necessary metrics for the classes from the confusion matrix, now we can calculate the performance measures. In this case, for class 1:
Accuracy (the number of correctly classified data instances over the total number of data instances)
(TN + TP) / (TN + FP + TP + FN)Precision (essentially the number of true positives compared to false positives)
TP / (TP + FP)Recall (the ability of a model to find all the relevant cases within a data set)
TP / (TP + FN)F1-Score
2 * (Precision * Recall) / (Precision + Recall)Sometimes it's interesting to visualize the classes. In this case, we already know the labels, but we'll kind of 'hack' the KNeighborsClassifier to give us the opportunity to plot 'decision regions' which clearly identify not only the clusters but the 'cluster space' (allowing us to see how a point would be classified). Again, we see that a human might pick up on several other groups/classes, towards the right side. In this case, we'd like to do some desceriptive statistics to captuire information about these other groups that certainly look like clusters, and understand what meaning they might hold.
Changing n_neighbors does change the areas of decision but in this case only marginally. I'll include two extremes (1, which is better, and 50, which is less accurate) here for comparison's sake.
from mlxtend.plotting import plot_decision_regions
def knn_comparison(data, k):
# Values
x = data[['Ch1','Ch2']].values
y = data['Cluster_labels'].astype(int).values
# "Classifier"
classifier = KNeighborsClassifier(n_neighbors=k)
classifier.fit(x, y)
# Plotting decision region
plot_decision_regions(x, y, clf=classifier, legend=2)
# Adding axes annotations
plt.xlabel('Ch1')
plt.ylabel('Ch2')
plt.title('Knn with K = {}'.format(str(k)))
plt.show()
for i in [1, 80]:
knn_comparison(raw_data, i)
/usr/local/lib/python3.9/site-packages/mlxtend/plotting/decision_regions.py:300: UserWarning:
You passed a edgecolor/edgecolors ('black') for an unfilled marker ('x'). Matplotlib is ignoring the edgecolor in favor of the facecolor. This behavior may change in the future.
/usr/local/lib/python3.9/site-packages/mlxtend/plotting/decision_regions.py:300: UserWarning:
You passed a edgecolor/edgecolors ('black') for an unfilled marker ('x'). Matplotlib is ignoring the edgecolor in favor of the facecolor. This behavior may change in the future.
Classification problems are the bread and butter of machine learning techniques and there is no shortage of methods.
While the KNeighborsClassifier is popular because it does not require a training period and is relatively easy to implement, it is known for being poor with datasets of high dimensionality, as they complicate the distance calculation for each additional dimension. It's also sensitive to noisy / missing data. In addition, feature scaling requires that data in all dimensions be normalized and standardized properly. Speed is also an issue, although iterative retraining with smaller subsets (batches) of the full dataset can alleviate some of these issues.
This can be compared to other more recent methods like BIRCH or DBSCAN. Both claim, famously, to be efficient in the case of noise. And while KNN is supervised, BIRCH is not. It performs hierarchical (tree-based) clustering and is apt to do well with very large datasets. It first clusters datasets in small summaries, eventually accumulating more and more data. It does not actually directly cluster the entire dataset all at once and so is particularly useful for situations in which data is continuously incoming. That way, it is not required to fit the entire dataset into main memory.
Packages are, of course, wonderful tools but we must not lose sight of their underlying mechanisms. As is obvious from most of the above, we can often tell a visually compelling data story with 'just' the major plotting packages. But, in certain circumstances, like when we are tweaking parameters or hyper-parameters, we need to appreciate, for example, sns.kdeplot's relationship to scipy.stats.gaussian_kde or the interconnection between statsmodels.graphics.gofplots.qqplot_2samples and statsmodels.api.ProdPlot.
As an example, to return to classification, we can easily generate a back of the envelope K-means clustering with k=4. This clustering is far less accurate than the KNNeighborsClassifier from before, and that's okay. It just goes to show how different algorithms can yield different results. The purpose of employing KMeans here is to go into a little more detail as to how the algorithm works. Here is the quick and dirty:
from sklearn.cluster import KMeans
# (random_state is simply a used for stochastic centroid initialization()
kmeans_classifier = KMeans(n_clusters=4, random_state=19)
kmeans_classifier.fit(channels)
sns.scatterplot(x = channels['Ch1'], y = channels['Ch2'], hue=kmeans_classifier.labels_)
plt.title('KMeans Clustering (k=4)')
Text(0.5, 1.0, 'KMeans Clustering (k=4)')
Notice, above we use n_init=10 (the number of times the k-means algorithm will be run with different centroid seeds). We also choose k=4 (the number of clusters) simply because of our prior knowledge. But how do we explain these choices? An exploration of the actual algorithm can help us understand.
Given a 2D set of coordinates, would would be the best way of determining if there are two, or three, or four clusters? Let's say we start with two groups, arbitrarily. We can choose two of the points (at random) in our dataset as centers (or centroids) for our groups. Then, for each remaining point, we measure its (Euclidian) distance to those two points. Points become part of the cluster based on the closest distance to one or the other centroid. This is our first iteration and isn't expected to produce fantastic results.
How can we improve our clusters? The approach generally used is to repeat the process and choose different points to be the references of the groups. This second time, we won't choose them at random like we did before; instead, we choose the mean of all of the already grouped points (or smaller batch of points), ensuring that the new centroids can now be positioned towards the middle of the corresponding groups. Thus, we move the centroids towards a more mean-centered location. This is the second iteration.
As we repeat the process, we continually create two new centroids for our groups and re-assign the points based on distance. We generally continue in this way until the new centroids are the same as the ones of the previous iteration, indicating convergence. In short, we've assumed the number of clusters, randomly chose initial points, and updated centroids in each iteration until clusters reach an unchangeable state.
Now, we still have the problem of choosing the value of k. There are two standard methods, the Elbow Method and the Silhouette Method.
The elbow method is naive but well studied. It can be defined:
Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.
The Squared Error for each point is the square of the distance (Euclidean, usually) of the point from its representation i.e. its predicted cluster center. It's deviance from the centroid, so to speak. The WSS score is the sum of these Squared Errors for all the points. By looking at a plot, or the values, of the sum of squared errors for a series of values of k, the optimal value is, for lack of a better term, the elbow or inflection point.
The silhouette method is similar in that it compares how similar a point is to its own cluster compared to other clusters. The silhouette coefficient is thus calculated using the mean intra-cluster distance and the mean nearest-cluster distance for each sample.
Both metrics are used to arrive at an optimal value of k.
It is important to note that while K-Means does perform well on large datasets, it is often sensitive to outliers and extreme values, which enhance variability and reduce the value of the mean part of K-means. Extreme values and outlier analysis should be employed, especially with long tails as we have in our assay data.